library(xgboost)
source("scripts/error_analysis.r")
source("scripts/training_fcts.r")
Previously we trained a random forest model, which did okay. This time we’re going to fast forward two decades and use XGBoost (eXtreme Gradient Boost). While it was surely named by a 12 year old at heart, this model has also been known as the “GBM (Gradient Boosting Machine) Killer”, since it has performed Xtremely competitively. The details of this algorithm is rather complicated, but in essence boosting is another ensemble technique.
While random forest uses bagging (boostrap aggregation) to randomly sample the input with replacement, and average the results by some metric to reduce overfitting, boosting takes inherently weak classifiers in conjunction to form a more powerful one. Other algorithms in this family are the adaboost (adaptive boost), gradient boost, CatBoost, etc.
We use XGBoost to skip to the head honcho of boosting. It’s not the latest and leading edge, but it’s a proven leader. And unlike adaboost and some others, XGBoost handles multiclass classification and regression inherently.
Real quick, let’s figure out which set to use with a baseline model. I know this isn’t the most foolproof way, but it’ll save some time for now and give a general indication.
# This stuff won't change between trials.
users = trainfuncs$read.csv('users_clean.csv')
train = trainfuncs$read.csv('reviews_simplified.csv')
train = train[, c("uid", "bid", "stars")]
val = trainfuncs$read.csv('validate_simplified.csv')
test = trainfuncs$read.csv('test_simplified.csv')
# This stuff will.
load_data = function(business_set) {
business_file <<- paste('business_', business_set, '.csv', sep='')
business <<- trainfuncs$read.csv(business_file)
combined <<- create_dmatrix(rbind(train, val), split=TRUE)
watchlist <<- create_watch()
dtest <<- create_dmatrix(test, split=FALSE)
}
# Convert data.frame to XGBoost's dense matrix format.
create_dmatrix = function(df, split=TRUE) {
df = trainfuncs$join_sets(df, split)
if (split) {
xgb.DMatrix(as.matrix(df$X), label=df$y)
} else {
xgb.DMatrix(as.matrix(df))
}
}
create_watch = function() {
dtrain = create_dmatrix(train)
dval = create_dmatrix(val)
# Return watchlist
list(train=dtrain, eval=dval)
}
Let’s train an XGBoost model for each set and see which one works better preliminarily.
trainfuncs$set_alg_name("xgboost")
default_params = list(eta = 0.3,
gamma = 0.5,
max_depth = 5,
min_child_weight = 5,
subsample = 0.8,
colsample_bytree = 0.8,
lambda = 1, alpha = 0.2,
scale_pos_weight = 1,
max_delta_step = 0,
objective = "reg:linear")
params = trainfuncs$get_params()
train_fct = xgb.train
predict_fct = predict
train_args = list(nrounds = 1000,
early_stopping_rounds = 25,
metrics = list("rmse"),
verbose = 1,
print_every_n = 10)
# Assumes watchlist has already been constructed.
train_ = function(train_args, params) {
if (params$final) {
train = combined
} else {
train = watchlist$train
}
train_args = c(data=train, watchlist=list(watchlist),
train_args)
params$args = list(params=params$args)
trainfuncs$train(train_args, params)
}
predict_ = function(model, params=NULL) {
if (params$final) {
test = dtest
} else {
test = watchlist$eval
}
trainfuncs$predict(model, newdata=test,
labels=getinfo(test, "label"),
params=params)
}
train.predict = function(train_args, params) {
model = train_(train_args, params)
predict_(model, params)
}
# Train model for label encoding.
load_data('labels')
model = train_(train_args, params)
## [1] train-rmse:2.527726 eval-rmse:2.629131
## Multiple eval metrics are present. Will use eval_rmse for early stopping.
## Will train until eval_rmse hasn't improved in 25 rounds.
##
## [11] train-rmse:0.998421 eval-rmse:1.045882
## [21] train-rmse:0.990811 eval-rmse:1.045447
## [31] train-rmse:0.987147 eval-rmse:1.046156
## Stopping. Best iteration:
## [14] train-rmse:0.994162 eval-rmse:1.044463
# Train model for one-hot.
load_data('onehot')
pred = train_(train_args, params)
## [1] train-rmse:2.527970 eval-rmse:2.630028
## Multiple eval metrics are present. Will use eval_rmse for early stopping.
## Will train until eval_rmse hasn't improved in 25 rounds.
##
## [11] train-rmse:1.000323 eval-rmse:1.052716
## [21] train-rmse:0.990041 eval-rmse:1.045635
## [31] train-rmse:0.986715 eval-rmse:1.046229
## [41] train-rmse:0.984178 eval-rmse:1.046446
## Stopping. Best iteration:
## [23] train-rmse:0.989308 eval-rmse:1.045371
# Train model for PCA.
load_data('pca')
model = train_(train_args, params)
## [1] train-rmse:2.539426 eval-rmse:2.655265
## Multiple eval metrics are present. Will use eval_rmse for early stopping.
## Will train until eval_rmse hasn't improved in 25 rounds.
##
## [11] train-rmse:0.995908 eval-rmse:1.049882
## [21] train-rmse:0.985818 eval-rmse:1.048946
## [31] train-rmse:0.981104 eval-rmse:1.049808
## Stopping. Best iteration:
## [13] train-rmse:0.992072 eval-rmse:1.047983
# Train model for Word2Vec.
load_data('wv')
model = train_(train_args, params)
## [1] train-rmse:2.539866 eval-rmse:2.656197
## Multiple eval metrics are present. Will use eval_rmse for early stopping.
## Will train until eval_rmse hasn't improved in 25 rounds.
##
## [11] train-rmse:0.996692 eval-rmse:1.050527
## [21] train-rmse:0.986050 eval-rmse:1.047638
## [31] train-rmse:0.980008 eval-rmse:1.049395
## [41] train-rmse:0.975048 eval-rmse:1.050839
## Stopping. Best iteration:
## [18] train-rmse:0.987941 eval-rmse:1.047256
It looks like labels and one-hot encoding actually performed the best. It’s actually better than Random Forest and CatBoost right now, by default. I did an extensive trial run with XGBoost before but couldn’t get it right, but I guess I have to re-evaluate. We’ll use one-hot, since it’s more mathematically sound.
We’ll be following the tuning strategy recommended here: https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
The first step is to tune min_child_weight and max_depth.
| Hyperparameter | Explanation |
|---|---|
| min_child_weight | Minimum sum of weights of observations in a leaf node. |
| max_depth | Maximum depth of the tree. |
| gamma | Minimum loss reducation required to make a split. |
load_data('onehot')
# Trains a combination of all arguments.
train_combo = function(...) {
# Create combination of arguments.
arg = c(as.list(match.call())[-1], best_updates)
arg = arg[!duplicated(names(arg))]
p_grid = do.call(expand.grid, arg)
for (i in 1:nrow(p_grid)) {
paste("\n\nTrial ", i, "/", nrow(p_grid), ".\n") %>% cat()
params = do.call(trainfuncs$get_params,
p_grid[i,])
pred = train.predict(train_args, params)
}
}
train_combo(gamma=c(0,0.25,0.5,0.6,0.7,0.8,0.9,1,2,3,5,7,10,15,20),
max_depth=seq.int(1,9,2),
min_child_weight=seq.int(1,9,2))
And the best one is:
best = trainfuncs$best_params(upper=376, verbose=TRUE)
## BEST PARAMS
## alpha: 0.2
## colsample_bytree: 0.8
## eta: 0.3
## gamma: 15
## lambda: 1
## max_delta_step: 0
## max_depth: 5
## min_child_weight: 7
## objective: reg:linear
## scale_pos_weight: 1
## subsample: 0.8
## rmse: 1.04395479529911
But the best doesn’t necessarily tell the whole story. For instance, there is some variation between each run due to randomness in the algorithm, and a point could be best due to noise. It might be informative to look at the overall trends between variables.
p_hist = trainfuncs$read.csv(trainfuncs$grid_file)[1:376,]
param_summary = function(p_hist, param) {
as.data.frame(p_hist %>%
group_by_(param) %>%
summarize(max=max(rmse),
mean=mean(rmse),
min=min(rmse)))
}
library(ggplot2)
plot_param = function(p_hist, param) {
bands = param_summary(p_hist, param)
ggplot(bands) +
geom_ribbon(aes(x=get(param), ymin=min, ymax=max),
fill='lightsalmon', alpha=0.3) +
geom_line(aes(get(param), mean), color='indianred2', size=1) +
ggtitle(paste("Mean RMSE\nby", param)) +
xlab(param) + ylab("RMSE") + scale_x_continuous(expand = c(0,0)) +
theme_bw()
}
plot_param(p_hist, "gamma")
plot_param(p_hist, "max_depth")
plot_param(p_hist, "min_child_weight")
Judging from the graphs, max_depth tends to be best around 5, while the others tend to improve score as they increase. We might want to see if higher values would be even better.
train_combo(gamma=c(10,15,20,25,30,35,40,45,50,75,100),
max_depth=5,
min_child_weight=seq.int(7,21,2))
Let’s see our fruits of progress.
p_hist = trainfuncs$read.csv(trainfuncs$grid_file)[1:464,]
plot_param(p_hist, "gamma")
Interesting. First thing to note is that gamma might have peaked around 30. The second thing to note is that fixing max_depth to 5 drastically reduced the range of error, even on the lower end. So higher depth, even though on average it increased error, might bring the lowest error on a single run. Part of this might be attributed to overfitting, but we’ll keep this result in mind.
plot_param(p_hist, "min_child_weight")
It seems min_child_weight bottomed out at around 11 and remained unchanging thereafter. Note that gamma of 30 is actually extremely high. Most online sources note that gamma should only be high when overfitting is an issue; typically, they say, it should be less than 1. We’ll also note this for now, and possibly adjust it later after we’ve adjusted regularization. Let’s see the lowest setting.
best = trainfuncs$best_params(upper=464, verbose=TRUE)
## BEST PARAMS
## alpha: 0.2
## colsample_bytree: 0.8
## eta: 0.3
## gamma: 15
## lambda: 1
## max_delta_step: 0
## max_depth: 5
## min_child_weight: 17
## objective: reg:linear
## scale_pos_weight: 1
## subsample: 0.8
## rmse: 1.04367071997167
This is a case of whether we pick the best trial, or the parameters that average best. In this case, we seemed to have traded a high gamma soft regularization for a high min_child_weight. Closer examination of the graphs show that this is indeed the lowest points, but a much higher mean occurred at these points. Would this mean that these settings are less robust, or are these actually the best settings? This is unclear. We could observe them in some detail.
param_range = function(p_hist, param) {
bands = param_summary(p_hist, param)
rbind(lowest_max=bands[which.min(bands$max),],
lowest_mean=bands[which.min(bands$mean),],
lowest_min=bands[which.min(bands$min),])
}
param_range(p_hist, "gamma")
## gamma max mean min
## lowest_max 30 1.044887 1.044376 1.043855
## lowest_mean 30 1.044887 1.044376 1.043855
## lowest_min 15 1.053042 1.046825 1.043671
param_range(p_hist, "min_child_weight")
## min_child_weight max mean min
## lowest_max 19 1.049049 1.045455 1.043931
## lowest_mean 11 1.049205 1.045298 1.043801
## lowest_min 17 1.050880 1.045955 1.043671
So we can see gamma of 15 achieves slightly lower minimum but higher mean and maximum than 30. Similarly, the lower error at min_child_weight of 17 gets higher mean and maximum error than 11, but not by much. I’m going to experiment a bit and try the record from lowest mean gamma, lowest mean min_child_weight, and lowest minimum min_child_weight
possible_updates = data.frame(gamma=c(30, 15, 15),
max_depth=c(5, 5, 5),
min_child_weight=c(11, 11, 17))
WIth the hyperparameters set for tree structure, let’s set parameters for subsampling the features that go into the tree.
| Hyperparameter | Explanation |
|---|---|
| subsample | Fraction of observations to be sampled with each tree. |
| colsample_bytree | Fraction of features to be sampled with each tree. |
We’ll try a range of these with each of the three “bests” we found above.
new_features = list(
subsample=c(0.6,0.7,0.8,0.9,1.0),
colsample_bytree=c(0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0))
for (i in 1:3) {
best_updates = possible_updates[i, ]
do.call(train_combo, new_features))
}
Let’s plot this for each of our chosen set of parameters.
library(gridExtra)
p_hist = trainfuncs$read.csv(trainfuncs$grid_file)[465:504,]
mean_gamma_sub = plot_param(p_hist, "subsample")
mean_gamma_col = plot_param(p_hist, "colsample_bytree")
mean_gamma_sub_table = param_range(p_hist, "subsample")
mean_gamma_col_table = param_range(p_hist, "colsample_bytree")
p_hist = trainfuncs$read.csv(trainfuncs$grid_file)[505:544,]
mean_child_sub = plot_param(p_hist, "subsample")
mean_child_col = plot_param(p_hist, "colsample_bytree")
mean_child_sub_table = param_range(p_hist, "subsample")
mean_child_col_table = param_range(p_hist, "colsample_bytree")
p_hist = trainfuncs$read.csv(trainfuncs$grid_file)[445:584,]
min_child_sub = plot_param(p_hist, "subsample")
min_child_col = plot_param(p_hist, "colsample_bytree")
min_child_sub_table = param_range(p_hist, "subsample")
min_child_col_table = param_range(p_hist, "colsample_bytree")
grid.arrange(mean_gamma_sub, mean_gamma_col,
ncol=2, top="Best Mean gamma")
## subsample RMSE range, best mean gamma
## subsample max mean min
## lowest_max 0.7 1.047272 1.045799 1.043859
## lowest_mean 0.8 1.047859 1.045246 1.043859
## lowest_min 0.8 1.047859 1.045246 1.043859
## colsample_bytree RMSE range, best mean gamma
## colsample_bytree max mean min
## lowest_max 0.9 1.045072 1.044590 1.044293
## lowest_mean 1.0 1.045208 1.044444 1.043859
## lowest_min 0.8 1.045137 1.044562 1.043859
grid.arrange(mean_child_sub, mean_child_col,
ncol=2, top="Best Mean min_child_Weight")
## subsample RMSE range, best mean min_child_weight
## subsample max mean min
## lowest_max 1.0 1.045671 1.044649 1.043918
## lowest_mean 1.0 1.045671 1.044649 1.043918
## lowest_min 0.6 1.047860 1.046068 1.043817
## colsample_bytree RMSE range, best mean min_child_weight
## colsample_bytree max mean min
## lowest_max 1 1.044603 1.044213 1.043817
## lowest_mean 1 1.044603 1.044213 1.043817
## lowest_min 1 1.044603 1.044213 1.043817
grid.arrange(min_child_sub, min_child_col,
ncol=2, top="Best Minimum min_child_Weight")
## subsample RMSE range, best minimum min_child_weight
## subsample max mean min
## lowest_max 1.0 1.049486 1.045301 1.043918
## lowest_mean 1.0 1.049486 1.045301 1.043918
## lowest_min 0.7 1.050192 1.045765 1.043797
## colsample_bytree RMSE range, best minimum min_child_weight
## colsample_bytree max mean min
## lowest_max 1.0 1.045208 1.044372 1.043817
## lowest_mean 1.0 1.045208 1.044372 1.043817
## lowest_min 0.9 1.045477 1.044523 1.043797
It’s interesting that while there’s a clear downward trend with colsample_bytree on each set, subsample is more ambiguous. We can also see that the subsample error is extremely volatile (i.e. high range) for high gamma (gamma=30 on best mean gamma), and for high min_child_weight (min_child_weight=17 on best minimum min_child_weight). Since these two have regularization effects, it probably means that no more regularization with subsample would be necessary.
Also, while using the best minimum min_child_weight setting achieves the lowest minimum error, it is on average higher than using the best mean min_child_weight setting.
While it seems exponentially infeasible to test multiple settings on each iteration of parameters, I think it’s useful to interpret the problem in this way. Mean settings are like the expected value of a particular parameter, which in theory generalizes better to the test set. Minimum settings show how much better of a score I can get if I’m lucky. But in order to capitalize on this luck, I would likely need a high number of submissions, which overfits the test set in the process (ideally, if I submit all combinations of results, I get 100%). I think the point of the exercise is to minimize error while minimizing submissions, so using the best mean is the way to go.
Going forward, we’ll use that assumption. It seems using best mean min_child_weight gives the best results, with a minimum not far off from the lowest.
best_updates = possible_updates[2, ]
best_updates$subsample = 1
best_updates$colsample_bytree = 1
Now let’s tune the regularization. It seems variance isn’t particularly a problem, but let’s see.
| Hyperparameter | Explanation |
|---|---|
| lambda | L2 regularization coefficient. |
| alpha | L1 regularization coefficient. |
new_features = list(
lambda=c(0,1e-4,1e-3,5e-3,1e-2,0.1,0.5,1,1.5,2),
alpha=c(0,1e-4,1e-3,5e-3,1e-2,0.1,0.5,1,1.5,2))
do.call(train_combo, new_features)
p_hist = trainfuncs$read.csv(trainfuncs$grid_file)[585:684,]
grid.arrange(plot_param(p_hist, "lambda"),
plot_param(p_hist, "alpha"),
ncol=2)
We can clearly see here that increasing lambda worsens error, while increasing alpha improves it. It’s kind of puzzling why this would occur, but it might be caused by the curse of dimensionality. While it doesn’t generally affect trees, the regularization it self seems to based on L2 and L1 norms respectively. L2 norms tend to be influenced by outliers more, or when there is a large feature set, the distance increases with the number of dimensions more (for \[x_i > 1\]).
Anyways, we should try increasing alpha until we see a “U” to achieve an optimum.
param_range(p_hist, "lambda")
## lambda max mean min
## lowest_max 0.01 1.044449 1.044281 1.043737
## lowest_mean 0.01 1.044449 1.044281 1.043737
## lowest_min 0.10 1.044572 1.044317 1.043701
new_features = list(lambda=c(0,1e-4,1e-3,0.01,0.1,0.2),
alpha=c(1.8,2,2.2,2.5,2.8,3))
do.call(train_combo, new_features)
p_hist = trainfuncs$read.csv(trainfuncs$grid_file)[685:720,]
plot_param(p_hist, "alpha")
param_range(p_hist, "alpha")
## alpha max mean min
## lowest_max 2 1.043737 1.043725 1.043701
## lowest_mean 2 1.043737 1.043725 1.043701
## lowest_min 2 1.043737 1.043725 1.043701
Okay, it’s really odd that every alpha=(1.8,2.0) converge. alpha=2.0 goes quite a bit lower. And this is regardless of lambda. Error diverges as alpha goes higher. Low lambda is still good, but exactly where is ambiguous. The mean are very close, as well as the range.
Let’s take this opportunity to fine tune while revisiting gamma.
new_features = list(subsample=c(0.9,0.95,1),
colsample_bytree=c(0.9,0.95,1),
lambda=c(0.005,0.01,0.05,0.1,0.15,0.2,0.25),
alpha=c(1.7,1.8,1.9,2,2.1))
do.call(train_combo, best_updates)
p_hist = trainfuncs$read.csv(trainfuncs$grid_file)[721:1035,]
grid.arrange(plot_param(p_hist, "subsample"),
plot_param(p_hist, "colsample_bytree"),
plot_param(p_hist, "lambda"),
plot_param(p_hist, "alpha"),
ncol=2, top="Fine tune")
param_range(p_hist, "alpha")
## alpha max mean min
## lowest_max 2.0 1.044838 1.044042 1.043428
## lowest_mean 1.9 1.044963 1.044018 1.043380
## lowest_min 1.8 1.045248 1.044024 1.043114
subsample and colsample_bytree have clear minima at 0.95 and 1.0, respectively. lambda and alpha have quite similar distributions in our fine tuning range. At this point towards the end of our grid search, we might want to aim for lower lows at some sacrifice of lower highs, provided that the overall range shifts downwards.
best = trainfuncs$best_params(upper=1035, verbose=TRUE)
## BEST PARAMS
## alpha: 1.8
## colsample_bytree: 1
## eta: 0.3
## gamma: 15
## lambda: 0.1
## max_delta_step: 0
## max_depth: 5
## min_child_weight: 11
## objective: reg:linear
## scale_pos_weight: 1
## subsample: 0.95
## rmse: 1.04311425370594
best$rmse = NULL
Those seem like good parameters to use. Before, when we trained max_depth, we didn’t use any regularization, so the error exploded with higher depth. As a sanity check, let’s test those assumptions again.
best_updates = best
new_features = list(max_depth=c(seq.int(5,11,2)),
lambda=c(0.1,0.25,0.4,0.7),
alpha=c(1.8,2.0,2.25,2.5))
do.call(train_combo, new_features)
p_hist = trainfuncs$read.csv(trainfuncs$grid_file)[1036:1099,]
grid.arrange(plot_param(p_hist, "max_depth"),
plot_param(p_hist, "lambda"),
plot_param(p_hist, "alpha"),
ncol=2, top="Depth vs Regularization")
We’ve reconfirmed that increasing max_depth makes it worse, although it’s interesting that alpha is now improving past 2.0. Regardless, it’s not achieving the same lows we’ve had previously, which perhaps shows the fallacy of relying on low trial runs without cross validation. However, cross validation takes too long for this project. We’ll tune gamma for the parameters we’ve settled on.
new_features = list(gamma=c(0,0.5,1,2,5,10,15,20,25,30),
max_depth=c(4,5,6),
min_child_weight=c(10,11,12))
do.call(train_combo, new_features)
p_hist = trainfuncs$read.csv(trainfuncs$grid_file)[1100:1189,]
grid.arrange(plot_param(p_hist, "gamma"),
plot_param(p_hist, "max_depth"),
plot_param(p_hist, "min_child_weight"),
ncol=2, top="Revisiting Gamma")
It seems gamma is fine where it is.
We left the learning rate for last, since decreasing it might make it more robust, but also greatly increases training time.
| Hyperparameter | Explanation |
|---|---|
| eta | Learning rate. |
best_updates = trainfuncs$best_params(upper=1189)
best_updates$rmse = NULL
for (i in 1:3) {
new_features = list(eta=c(0.001,0.005,0.01,0.05,0.1,0.3,0.5))
do.call(train_combo, new_features)
}
p_hist = trainfuncs$read.csv(trainfuncs$grid_file)[1190:1196,]
eta_plot = plot_param(p_hist, "eta")
eta_plot + ylim(1.0425, 1.0465)
best_updates = trainfuncs$best_params(upper=1196, verbose=TRUE)
## BEST PARAMS
## alpha: 1.8
## colsample_bytree: 1
## eta: 0.05
## gamma: 15
## lambda: 0.1
## max_delta_step: 0
## max_depth: 5
## min_child_weight: 11
## objective: reg:linear
## scale_pos_weight: 1
## subsample: 0.95
## rmse: 1.04261199696851
best_updates$rmse = NULL
We’ve reduced validation RMSE from roughly 1.0448 to 1.0426. We could fine tune further, perhaps, but the improvements we’d be chasing are increasingly small. Given that there are plenty of stones unturned in terms of different model hypotheses, it wouldn’t be the best use of our time. So we’ll use these parameters to train a final model to predict on the test set.
model = readRDS('models/xgboost_1.8_1_0.05_15_0.1_11_0.95.rds')
imp = xgb.importance(model = model)
xgb.ggplot.deepness(model = model)
xgb.ggplot.importance(importance_matrix = imp, top_n = 20)
xgb.plot.multi.trees(model, features_keep = 5)
set.seed(999)
best_updates$eta = 0.1
new_features = list(final = TRUE)
train_args = list(nrounds = 175,
verbose = 1,
print_every_n = 10)
train.predict = function(train_args, params) {
model = train_(train_args, params)
predict_(model, params)
}
do.call(train_combo, new_features)
The best submission score was 1.05030, which is kind of disappointing. With all that grid search, and a validation score of 1.032, I expected to get a little better. This highlights overfitting, even on the validation set. Future trials should probably combine train and validation sets, then use K-fold cross validation, so that we won’t evaluate based on the same validation set performance over and over again.
Even though our best score for this model is comparatively horrid (it did worse than linear regression), we can make a final model and find, for instance, what it sees as important features.
model = readRDS('models/xgboost_final_1.8_1_0.1_15_0.1_11_0.95.rds')
y = getinfo(combined, 'label')
pred = trainfuncs$predict(model, newdata=combined)
score = analysis$rating_score(pred, y)
score
## accuracy precision recall fscore
## 1 0.9395933 0.9044527 0.07516189 0.1387900
## 2 0.8995901 0.2371190 0.07592216 0.1150174
## 3 0.7235222 0.2877678 0.38119552 0.3279576
## 4 0.5316087 0.3974652 0.74498694 0.5183698
## 5 0.7267871 0.8084740 0.23844544 0.3682746
library(ranger)
rf = readRDS('models/ranger_final_40_306_respec.rds')
business = trainfuncs$read.csv('business_cats.csv')
combined = rbind(train, val)
combined = trainfuncs$join_sets(combined, split=TRUE)
pred = trainfuncs$predict(rf, newdata=combined$X,
pred_obj="predictions")
rf_score = analysis$rating_score(pred, combined$y)
rf_score
## accuracy precision recall fscore
## 1 0.9406767 0.9892183 0.08487512 0.1563365
## 2 0.8952119 0.2639740 0.12262562 0.1674599
## 3 0.7710387 0.3960056 0.55933877 0.4637098
## 4 0.6675037 0.5051141 0.85327490 0.6345770
## 5 0.7907133 0.9687711 0.38579970 0.5518377
analysis$score_dist(score, rf_score)
## accuracy precision recall fscore
## 1 -0.001083326 -0.08476564 -0.009713228 -0.01754649
## 2 0.004378236 -0.02685498 -0.046703456 -0.05244256
## 3 -0.047516587 -0.10823780 -0.178143248 -0.13575224
## 4 -0.135895042 -0.10764895 -0.108287961 -0.11620714
## 5 -0.063926234 -0.16029714 -0.147354260 -0.18356304
And this highlights a serious flaw in our error analysis methodology. I’m guessing that if the predictions are closer to actual scores in most instances, but more rounds to the wrong number, it is possible to have better RMSE but worse accuracy, precision, and recall. In other words, our predictions for each rating have higher variance, but higher kurtosis (picture a normal curve with longer tails but a higher middle).
Therefore, we’ll update our error analysis suite next.